Spatial Ecology and Macroecology

Practical - Week 4

Carmen Soria, François Leroy & Gabriel Ortega

(Department of Spatial Sciences)

2022-10-23

What are we going to see today?

A brief introduction to Species Distribution Modelling (SDM)

  1. What are SDMs?
  2. SDM workflow
  3. Resources
  4. SDM approaches
  5. SDM example (Generalized Linear Models)
  6. SDMs limitations

What are SDMs?

SDM overview

  • Tool that aims to predict where species could potentially be located from a limited set of observations.

  • It can also used to estimate a species’ niche from its distribution.

  • It’s a huge field, primarily used in ecology and conservation.

  • It’s also a recent field, that quickly changes and advances, and it’s full of problems and challenges.

SDM workflow

ODMAP protocol (Zurell et al. 2020)

Resources: Tutorials

Resources: Tutorials

SDM approaches

Different SDM algorithms need different types of occurrence data. These are the three main approaches:

  1. Presence-only methods:
    1. Absence data is not necessary.
    2. They can only estimate habitat suitability, not the probability of occurrence.
    3. They are sensitive to spatial biases.
    4. Example:
      1. MaxEnt

SDM approaches

  1. Presence-absence methods:
    1. Use presence/absence data.
    2. Estimate true probability of occurrence.
    3. Sensitive to imperfect detections.
    4. Examples:
      1. Regression-based methods: Generalized Linear models (GLM), Generalized Additive Models (GAM).
      2. Machine-learning methods: Random forest (RT), Boosted Regression Trees (BRT).

SDM approaches

  1. Presence-pseudo absence methods:
    1. Generate artificial absences to allow using presence-absence methods.
      1. The Barbet-Massin et al. (2012) paper is a good starting point on how to generate them.
    2. They inform of the available environmental conditions (also known as background data)
    3. Example: GLM, Random forest (and other presence-absence methods).

SDM approaches

  1. Ensemble methods:

    • If we perform many models, which one do we choose?

    • We could choose the one that’s best suited to our data, or that performs the best.

    • However, a more popular approach is performing ensemble models.

    • In this approach, predictions from multiple models are combined or averaged to produce a single model.

    • The most frequently used package is biomod.

EXERCISE Data preparation for the Iberian lynx (Lynx pardinus)

1. Conceptualisation

  • Taxa: Iberian lynx (Lynx pardinus)
  • Objective: find climatically suitable areas
  • Biodiversity data: GBIF observations
  • Predictors: Climatic predictors
  • Scale: Iberian Peninsula

2. Data preparation

2.1 Getting species observations (Practical 1)

2.2 Checking and cleaning the data

Not all species’ observations are reliable, we always need to check their quality before using them.

We will use the rnaturalearth and ggplot2 packages to view the data and CoordinateCleaner package and the provided GBIF data.

pacman::p_load(tidyverse, rnaturalearth, rnaturalearthdata, sf, raster, ggplot2, CoordinateCleaner)
load("data/lynx_gbif.RData") # Loading the species observations
# Set project crs
proj_crs <- 4326

2.2 Checking and cleaning the data

Let’s inspect the downloaded occurrences, does it make sense?

2.2 Checking and cleaning the data

All records seem to be located in the Iberian Peninsula 👍

We will use the function clean_coordinates() from CoordinateCleaner to test for duplicates (duplicates), outliers (outliers), ensure that the coordinates match the country (centroids), and whether the coordinates are near research institutions (institutions). This process takes time, so we already provide the output.

# Not running - time demanding
# lynx_gbif_clean <- clean_coordinates(lynx_gbif,
#                                        lon="decimalLongitude",
#                                        lat="decimalLatitude",
#                                        countries="countryCode",
#                                        tests=c("centroids", "outliers", 
#                                           "duplicates", "institutions"),
#                                        inst_rad = 1000)

2.2 Checking and cleaning the data

Let’s check the cleaned data set. First we load the cleaned data set.

load(file='data/gbif_lynx_cleaned.RData')

Then we restrict the map to Western Europe and plot the cleaned occurrences on top.

# Bounding box for Western and Central mainland Europe
lon_ext <- c(-10, 20)
lat_ext <- c(35, 55)
# Plot cleaned presences
europe_map <- world_map[which(world_map$continent == "Europe"),]
ggplot(europe_map) +
  geom_sf() +
  coord_sf(xlim = lon_ext, ylim = lat_ext, expand = FALSE) +
  geom_point(data = lynx_gbif_clean, 
             aes(x = decimalLongitude, y = decimalLatitude), color = "red")

2.2 Checking and cleaning the data

Plot of the cleaned presences

2.3 Getting climatic data (Practical 3)

Download the climatic data from WorldClim using the package geodata. We download the data for our area of study (Spain and Portugal). The meaning of each bioclimatic variable code can be found here.

pacman::p_load(geodata)
bioclim_data <- worldclim_country(country = c("Spain", "Portugal"),
                                  var = "bio",
                                  path = "data/")

2.4 Generating pseudo-absences

Different SDM algorithms need different types of occurrence data (presence only and presence-absence). Getting absence data is rare, however, we can artificially generate background or pseudoabsence points. For that, we will use the package dismo.

pacman::p_load(dismo)  # dismo to generate pseudo-absences

First, we prepare the occurrence data.

# Getting only the occurrence coordinates
lynx_gbif_obs <- lynx_gbif_clean[,
    c("decimalLatitude", "decimalLongitude")]

# Transforming it to an sf object
lynx_gbif_sf <- st_as_sf(lynx_gbif_obs, 
                         coords = c("decimalLongitude", "decimalLatitude"), 
                         crs = st_crs(proj_crs))

2.4 Generating pseudo-absences

Then, prepare the input maps. In this case we are only going to consider the Iberian Peninsula. The rnaturalearth at scale = 110 does not include the Canary or Azores Islands (which we are not interested in). Therefore, we crop the WorldClim bioclim data to this area.

# Getting the map of Spain and Portugal
iberia_map <- rnaturalearth::ne_countries(
  scale = 110,
  type = "countries",
  country = c("spain", "portugal"),
  returnclass = "sf"
)

## Getting the extent of our area of study
geographic_extent <- st_bbox(iberia_map)

# Crop bioclim data to desired extent
bioclim_data <- crop(x = bioclim_data, y = geographic_extent)

2.4 Generating pseudo-absences

dismo requires a raster as an input to extract pseudoabsences. We will rasterize the Iberia map using the bioclim data as a base.

# Rasterize the map (dismo works with rasters)
iberia_rasterized <- rasterize(iberia_map, bioclim_data)
iberia_raster <- raster::raster(iberia_rasterized)

We randomly generate pseudo-absences with the function randomPoints. This function avoids generating pseudo-absences where there are already presences.

# Get pseudo-absences
set.seed(1254)
random_points <- dismo::randomPoints(iberia_raster, n = 500, p = as_Spatial(lynx_gbif_sf$geometry))
random_points_sf <- st_as_sf(data.frame(x = random_points[, 1], y = random_points[, 2]), 
                             coords = c("x", "y"), 
                             crs = st_crs(proj_crs))

2.4 Generating pseudo-absences

Plotting the presences and pseudo-absences

2.4 Generating pseudo-absences

Adding pseudo-absences to our presence dataframe

# Prepare the presences data to contain a column indicating 1 for presence
lynx_gbif_obs <- data.frame(lynx_gbif_obs, PA=1)
# Adding that same column to the pseudoabsence data
lynx_pseudoabs <- data.frame(random_points, PA = 0) 
names(lynx_pseudoabs) <- c('decimalLongitude','decimalLatitude', 'PA')
# Binding those data frames
lynx_gbif_obs <- rbind(lynx_gbif_obs, lynx_pseudoabs)

2.5 Extracting climate data for our presences

Here we extract the climatic data for each of our observations using the function extract() from the package terra.

pacman::p_load(terra)
## Adding climate data
lynx_obs_clim <- cbind(lynx_gbif_obs, 
                       terra::extract(x = bioclim_data, 
                                      y = lynx_gbif_obs[,c('decimalLongitude','decimalLatitude')], 
                                      cells=T) )

2.5 Adding climate data

Explore the climate data for our presence and pseudoabsence data set.

summary(lynx_obs_clim)
 decimalLatitude decimalLongitude       PA               ID        
 Min.   :36.25   Min.   :-9.421   Min.   :0.0000   Min.   :   1.0  
 1st Qu.:37.61   1st Qu.:-8.616   1st Qu.:0.0000   1st Qu.: 427.8  
 Median :38.07   Median :-6.935   Median :1.0000   Median : 854.5  
 Mean   :38.69   Mean   :-6.212   Mean   :0.7073   Mean   : 854.5  
 3rd Qu.:39.47   3rd Qu.:-4.220   3rd Qu.:1.0000   3rd Qu.:1281.2  
 Max.   :43.64   Max.   : 2.788   Max.   :1.0000   Max.   :1708.0  
                                                                   
 wc2.1_30s_bio_1  wc2.1_30s_bio_2  wc2.1_30s_bio_3 wc2.1_30s_bio_4
 Min.   : 3.683   Min.   : 5.292   Min.   :31.86   Min.   :276.4  
 1st Qu.:14.365   1st Qu.: 9.306   1st Qu.:39.35   1st Qu.:410.8  
 Median :15.854   Median :10.542   Median :41.74   Median :555.4  
 Mean   :15.068   Mean   :10.684   Mean   :41.66   Mean   :538.0  
 3rd Qu.:16.258   3rd Qu.:12.108   3rd Qu.:44.48   3rd Qu.:650.0  
 Max.   :18.396   Max.   :14.783   Max.   :47.77   Max.   :769.1  
 NA's   :36       NA's   :36       NA's   :36      NA's   :36     
 wc2.1_30s_bio_5 wc2.1_30s_bio_6  wc2.1_30s_bio_7 wc2.1_30s_bio_8 
 Min.   :20.20   Min.   :-7.900   Min.   :13.3    Min.   :-1.767  
 1st Qu.:27.20   1st Qu.: 1.500   1st Qu.:21.4    1st Qu.: 9.400  
 Median :29.90   Median : 4.700   Median :26.5    Median :11.408  
 Mean   :29.92   Mean   : 4.018   Mean   :25.9    Mean   :10.745  
 3rd Qu.:32.80   3rd Qu.: 6.700   3rd Qu.:30.4    3rd Qu.:12.417  
 Max.   :36.80   Max.   : 9.500   Max.   :36.4    Max.   :18.033  
 NA's   :36      NA's   :36       NA's   :36      NA's   :36      
 wc2.1_30s_bio_9  wc2.1_30s_bio_10 wc2.1_30s_bio_11 wc2.1_30s_bio_12
 Min.   : 1.333   Min.   :12.13    Min.   :-3.217   Min.   : 213.0  
 1st Qu.:20.367   1st Qu.:20.72    1st Qu.: 6.912   1st Qu.: 494.8  
 Median :21.833   Median :22.13    Median : 9.542   Median : 548.0  
 Mean   :21.393   Mean   :22.05    Mean   : 8.936   Mean   : 587.3  
 3rd Qu.:23.587   3rd Qu.:23.77    3rd Qu.:11.217   3rd Qu.: 599.0  
 Max.   :26.433   Max.   :26.43    Max.   :13.350   Max.   :1786.0  
 NA's   :36       NA's   :36       NA's   :36       NA's   :36      
 wc2.1_30s_bio_13 wc2.1_30s_bio_14 wc2.1_30s_bio_15 wc2.1_30s_bio_16
 Min.   : 30.00   Min.   : 0.000   Min.   :16.32    Min.   : 84.0   
 1st Qu.: 70.75   1st Qu.: 2.000   1st Qu.:47.58    1st Qu.:188.0   
 Median : 92.00   Median : 4.000   Median :59.65    Median :250.0   
 Mean   : 90.59   Mean   : 7.928   Mean   :56.25    Mean   :246.1   
 3rd Qu.:102.00   3rd Qu.: 8.000   3rd Qu.:67.72    3rd Qu.:273.0   
 Max.   :266.00   Max.   :74.000   Max.   :77.06    Max.   :759.0   
 NA's   :36       NA's   :36       NA's   :36       NA's   :36      
 wc2.1_30s_bio_17 wc2.1_30s_bio_18 wc2.1_30s_bio_19      cell        
 Min.   : 13      Min.   : 17.00   Min.   : 65.0    Min.   :  19831  
 1st Qu.: 21      1st Qu.: 26.00   1st Qu.:171.0    1st Qu.: 774255  
 Median : 29      Median : 32.00   Median :229.0    Median :1027238  
 Mean   : 41      Mean   : 46.63   Mean   :226.5    Mean   : 915789  
 3rd Qu.: 41      3rd Qu.: 47.00   3rd Qu.:255.0    3rd Qu.:1111500  
 Max.   :259      Max.   :260.00   Max.   :759.0    Max.   :1356132  
 NA's   :36       NA's   :36       NA's   :36                        

2.5 Spatial thinning

We perform a process called spatial thinning to keep only one presence or pseudoabsence point per raster climate cell. We will be using the spThin package.

pacman::p_load(spThin)
# spThin requires longitude and latitude coordinates, that we already have.
# It also requires a column with the species name
lynx_obs_clim$species <- 'Lynx pardinus'
  
# Remove adjacent cells of presence/background data:
xy <- thin(lynx_obs_clim, lat.col='decimalLatitude',long.col='decimalLongitude',
           spec.col='species', thin.par=30, reps=1, write.files=F,
           locs.thinned.list.return=T)

2.5 Spatial thinning

We select the maximum number of records after thinning (stored in the first object of the list of xy).

# Keep the maximum number of records after thinning
xy_keep <- xy[[1]]

We thin our data set, extracting the cell numbers for the thinned coordinates and using them to subset our data frame of presence and pseudoabsences.

cells_thinned <- terra::cellFromXY(iberia_raster, xy_keep)
lynx_thinned <- lynx_obs_clim[lynx_obs_clim$cell %in% cells_thinned,]
lynx_thinned <- na.omit(lynx_thinned)

2.5 Spatial thinning

Examine spatially the thinned dataset (presences in blue and pseudo absences in red)

2.6 Testing for collinearity in the data

Before running our models, we should test for correlation in our variables and select the ones that are not correlated and that are more relevant to our taxa. We will perform the Spearman correlation test, as we do not know if the data values are normally distributed.

# Running the correlation analysis for the climatic variables
cor_matrix <- cor(lynx_thinned[,-c(1:4, 24,25)], method='spearman')
# Set a correlation threshold (adjust as needed)
cor_threshold <- 0.7

# Perform hierarchical clustering to examine the hierarchical correlation between variables
dendrogram <- hclust(as.dist(1 - cor_matrix))

We plot the dendrogram

plot(dendrogram, main = "Dendrogram with Correlation Threshold")

# Add a horizontal line to mark the correlation threshold
abline(h = 1 - cor_threshold, col = "red", lty = 2)

2.6 Testing for collinearity in the data

# Plot the dendrogram
plot(dendrogram, main = "Dendrogram with Correlation Threshold")

# Add a horizontal line to mark the correlation threshold
abline(h = 1 - cor_threshold, col = "red", lty = 2)

2.6 Testing for collinearity in the data

Which variables should we select from each correlated cluster? Knowledge on the taxa is essential

2.6 Testing for collinearity in the data

In this case, we will keep it simple and select only Annual Mean Temperature (bio 1) and Annual Precipitation (bio 12).

# Selecting the most relevant and uncorrelated variables
my_preds <- c( "wc2.1_30s_bio_1", # Annual mean temperature
               "wc2.1_30s_bio_12") #Annual precipitation

2.7 Preparing the data and separating the data into training and testing

# Filtering the relevant columns of the data
lynx_data <- lynx_thinned %>% 
  dplyr::select(species, PA, decimalLongitude, decimalLatitude, any_of(my_preds))

# Select randomly training data into 70% and 30%
set.seed(1235)
train_i <- sample(seq_len(nrow(lynx_data)), size=round(0.7*nrow(lynx_data)))

# Subset the training and testing data
lynx_train <- lynx_data[train_i,]
lynx_test <- lynx_data[-train_i,]

We have the data ready to go! 🎉

EXERCISE Modelling (Lynx pardinus) using GLM

1. Presence-absence methods - GLMs

Fit the GLM

m_glm <- glm(PA ~ wc2.1_30s_bio_1 +
               wc2.1_30s_bio_12,
             family = "binomial",
             data = lynx_train)

1. Presence-absence methods - GLMs

Explore the GLM output

summary(m_glm)

Call:
glm(formula = PA ~ wc2.1_30s_bio_1 + wc2.1_30s_bio_12, family = "binomial", 
    data = lynx_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4631  -0.9091  -0.4536   1.0240   2.5428  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -6.4418710  1.6101371  -4.001 6.31e-05 ***
wc2.1_30s_bio_1   0.4409649  0.0970562   4.543 5.54e-06 ***
wc2.1_30s_bio_12 -0.0013064  0.0008209  -1.592    0.111    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 236.68  on 193  degrees of freedom
Residual deviance: 197.06  on 191  degrees of freedom
AIC: 203.06

Number of Fisher Scoring iterations: 5

1. Presence-absence methods - GLMs

Habitat suitability prediction

We can use the output of our model to predict the habitat suitability across the entire area of study. To achieve this, we use the function predict, passing the data of the climatic variables considered and our model. The argument type response will return the predicted probabilities from our model.

glm_predict <- predict(bioclim_data[[my_preds]], m_glm, type = "response")
plot(glm_predict)

1. Presence-absence methods - GLMs

Habitat suitability prediction

This map shows the probability of occurrence of our species across the study area. This value ranges from 0 (unsuitable) to 1 (suitable).

1. Presence-absence methods - GLMs

Model evaluation and habitat suitability prediction

Model evaluation and habitat prediction have been calculated using Jeff Oliver’s tutorial.

To obtain a binary map (suitable - unsuitable), we evaluate that model using the observations and pseudo-absences reserved for testing, using the function pa_evaluate() from the predicts package.

pacman::p_load(predicts)
glm_eval <- pa_evaluate(p = lynx_test[lynx_test$PA == 1, ], # presence data
                        a = lynx_test[lynx_test$PA == 0, ], # pseudoabsence data
                        model = m_glm, # model we are evaluating
                        type = "response")

Model evaluation

There are multiple metrics to evaluate SDMs. The AUC is a commonly used metric, although it has its limitations. It indicates how good is the model at distinguishing classes (in our case, presence from absence) and ranges from -1 to 1.

An AUC of 1 indicates that our model is able to perfectly discriminate between areas where the species is present from where it is absent. An AUC of -1 is the complete opposite, and an AUC of 0.5 means that our model is no better than random.

  np na prevalence       auc      cor         pcor      ODP
1 21 62   0.253012 0.8049155 0.447308 2.239462e-05 0.746988

What is the AUC of our model? What does this mean?

1. Presence-absence methods - GLMs

Model prediction

To do this, we need to identify the suitability threshold using the thresholds element from the function pa_evaluate() We chose max_spec_sense, which sets “the threshold at which the sum of the sensitivity (true positive rate) and specificity (true negative rate) is highest.”

# Determine minimum threshold for "presence"
glm_threshold <- glm_eval@thresholds$max_spec_sens
print(glm_threshold)
[1] 0.3193347

1. Presence-absence methods - GLMs

Binary map of predicted presence

This map is a categorical classification of whether the cell is suitable or not for the Iberian lynx using our selected threshold.

Any questions?